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It has been reported that the number of transcription factors en- 
coded in prokaryotic genomes scales approximately quadratically 
with their total number of genes. We propose a conceptual expla- 
nation of this finding and illustrate it using a simple model in which 
metabolic and regulatory networks of prokaryotes are shaped by hor- 
izontal gene transfer of co-regulated metabolic pathways. Adapting 
to a new environmental condition monitored by a new transcription 
factor (e.g. learning to utilize another nutrient) involves both ac- 
quiring new enzymes as well as reusing some of the enzymes already 
encoded in the genome. As the repertoire of enzymes of an organ- 
ism (its toolbox) grows larger, it can reuse its enzyme tools more 
often, and thus needs to get fewer new ones to master each new 
task. From this observation it logically follows that the number of 
functional tasks and their regulators increases faster than linearly 
with the total number of genes encoding enzymes. Genomes can 
also shrink e.g. due to a loss of a nutrient from the environment 
followed by deletion of its regulator and all enzymes that become 
redundant. We propose several simple models of network evolution 
elaborating on this toolbox argument and reproducing the empiri- 
cally observed quadratic scaling. The distribution of lengths of co- 
regulated pathways in our model quantitatively agrees with that of 
the real-life metabolic network of E. coli. Furthermore, our model 
provides a qualitative explanation for broad distributions of regulon 
sizes in prokaryotes. 

Horizontal Gene Transfer, Transcriptional regulatory networks, Functional genome 
analysis 

Abbreviations: HGT, Horizontal Gene Transfer; KEGG, Kyoto Encyclopedia of Genes 
and Genomes; TF, Transcription Factor 

Introduction 

Biological functioning of a living cell involves coordinated activ- 
ity of its metabolic and regulatory networks. While the metabolic 
network specifies which biochemical reactions the cell is in princi- 
ple able to carry out, its actual operation in a given environment is 
orchestrated by the transcription regulatory network through up- or 
down-regulation of enzyme levels. A large size of the interface be- 
tween these two networks in prokaryotes is indicated by the fact that 
nearly half of transcription factors in E.coli have a binding site for 
a small molecule (TJ, which implicates them [2| as potential regula- 
tors of metabolic pathways. This interface is further increased when 
one takes into account two component systems whose sensors bind to 
small molecules and only then activate a dedicated transcription fac- 
tor. Thus, at least in prokaryotes, regulation of metabolism occupies 
the majority of all transcription factors. 

Two recent empirical observations shed additional light on evo- 
lutionary processes shaping these two networks: 



genes, many local regulators controlling several targets each, and 
all regulon sizes in-between these two extremes. 

A simple evolutionary model explains both these empirical ob- 
servations in a unified framework based on modular functional design 
of prokaryotic metabolic networks and their regulation. 

A toolbox view of metabolic networks.. Metabolic networks 
are composed of many semi-autonomous functional modules corre- 
sponding to traditional metabolic pathways 1 8 1 or their subunits (9))- 
Constituent genes of such evolutionary modules tend to co-occur (be 
either all present or all absent) in genomes [10 9|. These pathways 
overlap with each other to form branched, interconnected metabolic 
networks. Many of these pathways/branches include a dedicated 
transcription factor turning them on under appropriate environmental 
conditions. In prokaryotic organisms there is a strong positive corre- 
lation between the number of protein-coding genes in their genomes, 
the number of metabolic pathways formed by these genes, the num- 
ber of transcription factors regulating these pathways, and, finally, 
the number of environments or conditions that organism is adapted 
to live in. 

We propose to view the repertoire of metabolic enzymes of an 
organism as its toolbox. Each metabolic pathway is then a collection 
of tools (enzymes), which enables the organism to utilize a partic- 
ular metabolite by progressively breaking it down to simpler com- 
ponents, or, alternatively, to synthesize a more complex metabolite 
from simpler ingredients. Adapting to a new environmental condi- 
tion e.g. learning to metabolize a new nutrient, involves acquiring 
some new tools as well as reusing some of the tools/enzymes that are 
already encoded in the genome. From this analogy it is clear that as 
the toolbox of an organism grows larger, on average, it needs to ac- 
quire fewer and fewer new tools to master each new metabolic task. 
This is because the larger is the toolbox the more likely it is to already 
contain some of the tools necessary for the new function. Therefore, 
the number of proteins encoded in organism's genome (i.e. the size 
of its toolbox) is expected to increase slower than linearly with the 
number of metabolic tasks it can accomplish. Or, conversely, the 
number of nutrients an organism can utilize via distinct metabolic 
pathways is expected to scale faster than linearly with its number of 
enzymes or reactions in its metabolic network. This last prediction 
is empirically confirmed by the data in the KEGG database (8): as 
shown in Fig. S6 in supplementary materials the best powerlaw fit 
to the number of metabolic pathways vs the number of metabolic re- 
actions in prokaryotic genomes has the exponent 2.2 ± 0.2. This is 
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• The number of transcriptional regulators is shown to grow faster 
than linearly 1 3 4 5 6 1 (approximately quadratically |4 1) with the 
total number of proteins encoded in a prokaryotic genome. 

• The distribution of sizes of co-regulated pathways (regulons), 
which in network language correspond to out-degrees of tran- 
scription factors in the regulatory network, has long tails Q- 
As a result the set of transcription factors of each organism 
includes few global ("hub") regulators controlling hundreds of 
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in agreement with quadratic scaling of the number of transcription 
factors |4| if one assumes that most of these pathways are regulated 
by a dedicated transcription factor. 



Results 

Evolution of networks by random removal and addition of 
pathways. We propose a simple model of evolution of metabolic and 
regulatory networks based on this toolbox viewpoint. The metabolic 
network of a given organism constitutes a subset of the "universal 
biochemistry" network, formed by the union of all metabolites and 
metabolic reactions taking place in any organism. An approxima- 
tion to this universal biochemistry can be obtained by combining all 
currently known metabolic reactions in the KEGG database (8). For 
prokaryotes, entire metabolic pathways from this universal network 
could be added all at once by the virtue of Horizontal Gene Transfer 
(HGT), which according to Ref. 1111 is the dominant form of evolu- 
tion of bacterial metabolic networks. Recent studies 1121 reported a 
number of HGT "highways" or preferential directions of horizontal 
gene transfer between major divisions of prokaryotes. As a result of 
these and other constraints the effective size of the universal network 
from which an organism gets most new pathways is likely to devi- 
ate from the simple union of reactions in all organisms. Metabolic 
networks can also shrink due to removal of pathways. This often 
happens when a nutrient disappears from the environment of an or- 
ganism over an evolutionary significant time interval (see "use it or 
loose it" principle by Savageau [ 13 1). A massive elimination of path- 
ways occurs e.g when an organism becomes obligate parasite fully 
relying on its host for "pre-processing" of most nutrients. 

The state-of-the-art information on metabolic networks is not ad- 
equate for a fully realistic modeling of their evolution. Fortunately, 
faster-than-linear scaling of the number of pathways and their regu- 
lators with the number of genes is the robust outcome of the toolbox 
evolution scenario and as such it is not particularly sensitive to topo- 
logical structure of the universal biochemistry network. In particular 
we found (see Fig. SI) essentially identical scaling in two models us- 
ing two very different variants of the universal biochemistry network: 

• the union of KEGG reactions |8| in all organisms. The part 
of this network connected to the biomass production consists of 
N un i V ~ 1800 metabolites; 

• a random spanning tree on the fully connected graph of N un i V 
metabolites. While certainly not realistic, this version is mathe- 
matically tractable. 

Furthermore, it turned out that many other details of pathway acqui- 
sition process do not change scaling exponents of our model (see 
Fig. S2 in Supplementary materials). In the rest of this study we 
use the first universal network (union of all KEGG reactions) in our 
numerical simulations of the model and the second network in our 
mathematical analysis. 

While toolbox view of evolution is equally applicable to 
catabolic (breakdown of nutrients) and anabolic (synthesis of com- 
plex metabolites) pathways, for simplicity we will simulate only ad- 
dition and removal of catabolic branches. Given the repertoire of 
enzymes of an organism each of the N un i v universal metabolites can 
be categorized as either "metabolizable" (connected to biomass pro- 
duction), or "non-metabolizable" (currently outside of the metabolic 
network). To add a new branch to the network in our model we first 
randomly choose a non-metabolizable molecule as a new nutrient 
(leaf). A pathway/branch that begins at the leaf and connects it to 
the set of metabolizable molecules is then added to the network. This 
connecting pathway consists of a linear chain of reactions randomly 
selected from the universal network until it first intersects with the 
already existing metabolic network of the organism. The leaf plus all 
the intermediate metabolites of this branch thereby become metabo- 
lizable. This process is illustrated in Fig. [IK. 



In our model pathway additions and removals are treated in a 
symmetric fashion. The steps leading to pathway deletion are illus- 
trated in Fig. [TJ3. First, one of the leaves of the network correspond- 
ing to a vanished nutrient is chosen randomly. The branch starting 
at this nutrient/leaf is followed downstream to the point where it first 
intersects another branch of the network. This entire path, starting 
from the leaf down to the merging point with another pathway is then 
removed from the network. The selected nutrient along with all in- 
termediate metabolites thereby become non-metabolizable. 

The network in our model evolves by a random sequence of path- 
way additions and removals (see Methods for more details). Since 
our goal is to understand how properties of metabolic and regulatory 
networks scale with the genome size of an organism, we take multi- 
ple snapshots of the evolving network with different values of N m et 
- the current number of nodes in the metabolic network, which in our 
model is equal to the number of reactions or metabolic enzymes. 

Assigning transcriptional regulators to metabolic pathways. 

Operation of metabolic networks involves regulating production of 
enzymes in response to nutrient availability. In prokaryotes most of 
this regulation is achieved at the transcriptional level. In order to in- 
vestigate the interface between metabolic and regulatory networks 
we extend our model to include transcription factors (TFs) which 
are activated by nutrient availability to turn on or off the enzymes 
in individual metabolic pathways. In the basic version of our model 
shown in Fig. |2]\ we chose the following simple method to assign 
TFs to reactions: one randomly picks a leaf/nutrient and follows its 
reactions downstream until this branch either reaches the metabolic 
core or merges with a pathway regulated by a previously assigned 
TF. A new TF is then assigned to regulate all reactions in this part 
of the nutrient utilization pathway. This process is repeated until all 
enzymes/reactions have been assigned a (unique) transcriptional reg- 
ulator (see Fig. |2]\). Each TF is activated by the presence of the cor- 
responding nutrient in the environment. Note that this method results 
in exactly one TF per nutrient, and that the out-degree distribution of 
TFs in the regulatory network is identical to the distribution of branch 
lengths in the metabolic network. 

In addition to this simple regulatory network architecture we 
have tried several others illustrated in Figs. [2J3-D. The advantage of 
these more complicated schemes is that they ensure that on/off states 
of connected metabolic pathways are properly coordinated with each 
other. For example, unlike in Fig. [2]\, in Figs. [2J3-D the red tran- 
scription factor (TF2) turns on the downstream (and only the down- 
stream) part of the blue pathway necessary for utilization of the red 
nutrient. We will further compare network topologies generated by 
these rules in the Discussion section. 

Comparison of the model with empirical data. In agreement 
with the toolbox argument outlined in the introduction, we found (see 
Fig. E}\) that the number of transcriptional regulators of an organism 
scales steeper than linearly with the total number of metabolites in 
its metabolic network, which in our model is equal to its number of 
reactions or enzymes: 

N TF oc (N met ) a . [1] 

The best fit has a = 1.8 ± 0.2. In Fig. |4]\ we directly compare nu- 
merical simulations of the toolbox model (red diamonds) to the em- 
pirical scaling of the number of transcription factors with the number 
of genes in all currently sequenced prokaryotic genomes (green cir- 
cles). To approximate the total number of genes N genes in the whole 
genome we multiplied the number of metabolites/reactions N m( >t by 
a constant factor. The empirical value of the ratio N me t/N genes ~ 
0.2 was estimated as follows: metabolic enzymes constitute about a 
quarter of all genes in a procaryotic genome independent of its size 
(see blue line in Fig. la of |4|). Due to presence of isoenzymes the 
number of different reactions catalyzed by these enzymes (equal to 
the number of metabolites N met in our model) is somewhat smaller 
and its average value over 45 1 fully sequenced prokaryotic genomes 
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1141 is 20%. The model results shown in Fig. [4] were simulated 
on the universal network formed by the union of KEGG reactions 
in all organisms. However, a model simulated on a random univer- 
sal network of the same size Nuniv — 1800 produced essentially 
identical results (black crosses in Fig. SI). This agreement indicates 
that the scaling between Ntf and N me t for the most part is deter- 
mined by just the number of universal metabolites - N un i V , and is not 
particularly sensitive to the topology of connections between them. 
On the other hand, we believe that nearly precise agreement of the 
actual number of regulators in real prokaryotic genomes and in the 
model is coincidental. Indeed, even in prokaryotes not all transcrip- 
tion factors are dedicated to regulation of metabolic enzymes. This 
means that to represent all TFs in the whole genome the number of 
metabolic transcription factors in our model has to be multiplied by a 
currently unknown number. Furthermore, as discussed in the begin- 
ning of the Results section the effective size of the universal network 
for real-life horizontal transfer of metabolic pathways is likely to be 
different from the union of all reactions currently listed in KEGG. 
We still believe that the KEGG-based universal network provides a 
correct order-of-magnitude estimate of N univ . Hence, the approxi- 
mate agreement between Ntf vs Ngenes plots in our model and real 
prokaryotic genomes is an encouraging sign. 

In addition to providing an explanation to the quadratic scaling 
between numbers of leaves and all nodes, our model nicely repro- 
duces the large-scale topological structure of real-life metabolic net- 
works. An example of a metabolic network generated by the toolbox 
model is shown in Fig. [3j$. Its tree-like topology reflects our sim- 
plification that each reaction converts a single substrate to a single 
product. The network is hierarchical in the sense that smaller linear 
pathways tend to be attached to progressively longer and longer path- 
ways, until they finally reach the metabolic core. This architecture 
is reminiscent of drainage networks in which many short tributaries 
merge to give rise to larger rivers. For comparison, in Fig. [3]\ we 
show a tree-like backbone (to match linear pathways in our model) 
of the E. coli metabolic network [8, 14 1 of approximately the same 
size as the model network in Fig. [5Ji. The details of generating this 
backbone are described in the Methods section. The overall topo- 
logical structure of real and model networks clearly resemble each 
other. 

To better quantify this visual comparison in Fig. [4jS we compare 
cumulative branch length distributions P(K ut > K) in our model 
with N m et — 400 (red diamonds for N univ — 1800 and red squares 
for Nuniv = 900) and in real metabolic network in E. coli of com- 
parable size (green circles). All three distributions are characterized 
by a long powerlaw tail: P(K ou t) ~ K~^ t . Best fit value of the ex- 
ponent 7 = 2.9 ± 0.2 is similar in model and real-life networks and 
agrees with our analytical result 7 = 3 derived in the next section. 
Furthermore, the data in our model simulated on a truncated univer- 
sal network with N univ = 900 (red crosses in Fig. [4ji calculated 
for the red network in Fig. |3j5) are in excellent agreement with their 
real-life conuterpart in E. coli (green circles in Fig. [4j3 calculated for 
the green network in Fig. |3]\) throughout the whole range. 

In Fig. S3 we compare distributions of regulon sizes (branch 
lengths) in our model (red diamonds in Fig. |4ji) and in the Regulon 
database |15| including all presently known transcriptional regula- 
tions in E. coli. One can immediately see that the tail of the distribu- 
tion in the Regulon database with the exponent ~ 2 is considerably 
broader than in our model. There are several possible explanations of 
this discrepancy: 1) coordination of activity of different metabolic 
pathways with each other as shown in Figs. [2fS-D inevitably in- 
creases out-degree of transcription factors and gives rise to larger 
regulatory hubs; 2) regulation of proteins other than metabolic en- 
zymes in the same regulon; 3) an anthropogenic effect in which bet- 
ter studied transcription factors included in the regulon database have 
larger-than-average out-degrees. In the Discussion section we return 
to comparison real-life and model regulatory networks in more de- 
tails. 



Mathematical derivation of scaling behavior in toolbox 
model. When a new nutrient (leaf) is added to a network of size 
N met , the length of the metabolic pathway required for its utilization 
is (on average) inversely proportional to Nmet- This result is easy to 
show for a mean-field version of the model on a randomly generated 
universal network. In this case each reaction in the new pathway has 
the same probability p — N m et / N un iv to end in one of the N m et 
currently metabolizable molecules. The minimal pathway required 
for utilization of the new nutrient involves only the reactions until the 
first intersection with the already existing metabolic network. The 
average length of such pathway is just the inverse of this probability: 
l/p = Nuniv /Nmet- When this pathway is added, the number of 
metabolizable molecules increases by AN m et ~ Nuniv I Nmet and 
the number of regulators increases by one: ANtf = 1. In the steady 
state of the model, removal of a branch produces the opposite result: 
AN met — — Nuniv / N m et , ANtf = — 1. In both cases one has: 



dNmet N u mv 



dN T F 

the integration of which gives 

Ntf = 



N„ 



t* 1 • umv 



[21 



[3] 



Therefore, the quadratic scaling between Ntf and N m et naturally 
emerges from our toolbox model. 

Similar calculations allow one to derive the scale-free distribu- 
tion of branch lengths (regulon sizes) in our model: 



N{K ou t 



k: 



with 



7 ; 



[4] 



Indeed, the expected length of a newly added metabolic pathway (or 
the out-degree of its regulator in transcription regulatory network 
shown in Fig. [2}\) is K ou t = N nn iv/Nmet- As the size of the 
metabolic network increases, the length of each new pathway pro- 
gressively shrinks. If the network was monotonically growing, longer 
pathways of length K ou t > K were added at the time when the num- 
ber of metabolites was smaller than N un iv/K or equivalently the 
number of transcription factors was below N U niv/{2K 2 ). There- 
fore, P(K ou t >K) = Numv/(2K 2 )/N T F or P(K ou t = K) ~ 
N U mv/(N T FK 3 ) so that 7 = 3 in Eq. g] As evident from Fig. 
[4j3 , random cycling through addition and removal of pathways in the 
steady state of our model does not significantly change this exponent 
with best fit value of7 = 2.9±0.2 shown as solid line in Fig. |4j3. 



Discussion 

Trends of average in- and out-degrees in the regulatory 
network as a function of genome size. As was pointed out by 
van Nimwegen 1 4] 1161 [171 faster- than-linear scaling of the number 
of transcription factors generates systematic differences in topol- 
ogy of transcriptional regulatory networks as a function of genome 
size. Indeed, the total number of regulatory interactions (pairs of 
TFs and their target genes) in a given organism can be written ei- 
ther as Ngenes {Ki n ) if one counts the incoming regulatory inputs of 
all genes, or as NTF{K ou t) if one counts the regulatory outputs of 
all transcription factors. Here the brackets denote the average over a 
given genome. Therefore, one always has 



N a , 



(Kin) 
{Kout) 



[5] 



The empirical data 1 3 . 4 1 indicate that the left hand side of this equa- 
tion monotonically grows with genome size and is roughly propor- 
tional to Ngenes ■ Therefore, an increase in the number of genes in 
larger genomes must be accompanied either by an increase in average 
in-degree (Ki n ) of all genes or by a decrease in average out-degree 
{K ut) of transcriptional regulators. The latter trend is indirectly 
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supported by the empirical observation 1 16 1 that the average operon 
size (a lower bound on regulon size) is negatively correlated with 
N genes ■ This trend also exists in our basic model (Fig. [2f\) in which 
Kout of transcription factors regulating newly added metabolic path- 
ways progressively decreases with N me t ~ N genes . Furthermore, 
another recent study [17] found no systematic correlation between 
{Ki n ) and N genes . This is the case in our model in Fig. |2]\ where 
all enzymes representing vast majority of all proteins in our model 
have the same Ki„ — 1 independently of genome size. However, 
such complete lack of coordination between different metabolic path- 
ways is not realistic. To correct this we explored several other reg- 
ulatory network architectures illustrated in Figs. [2?-D. In all these 
models enzymes are regulated by more than one transcription fac- 
tor. Transcription factors in the model in Fig. [2jS ensure a complete 
top-to-bottom regulation of the entire pathway for utilization of each 
nutrient. In this case centrally positioned metabolites have unreal- 
istically large in-degrees. Opposite to the basic model in Fig. [2]\, 
the average in-degree (K in ) in Fig. [2}} increases with N genes , while 
{K ou t) remains constant. Real-life regulatory networks are likely to 
be somewhere in-between these two extreme scenarios illustrated in 
Figs.[2h|2fe. 

Coordination of activity of upstream and downstream 
metabolic pathways. Converting a nutrient into biomass of an or- 
ganism often involves several successive pathways each regulated by 
its own transcription factor. States of activity of such pathways have 
to be coordinated with each other. Our basic model illustrated in Fig. 
|2]\ does not involve such coordination. In this model: 

• Transcription factors do not regulate other transcription factors. 
This results in "shallow" transcriptional regulatory networks con- 
sisting of only two hierarchical layers: the upper level including 
all regulators, and the lower level including all workhorse pro- 
teins (metabolic enzymes). While this assumption in its pure form 
is certainly unrealistic, it approximates the hierarchical structure 
of real prokaryotic regulatory networks, which were shown to be 
relatively shallow |7 18. 19 1. That is to say, the number of hi- 
erarchical layers in these networks was shown to be smaller than 
expected by pure chance 1 191. 

• In the regulatory network shown in Fig. [2]\ every enzyme is reg- 
ulated by precisely one transcription factor. Once again this fea- 
ture, while obviously unrealistic, approximates topological prop- 
erties of real-life regulatory networks e.g one in E. coli. In (7) it 
was shown that in this network the in-degree distribution peaks 
at one regulatory input per protein beyond which it rapidly (ex- 
ponentially) decays. This should be contrasted with a broad out- 
degree (regulon size) distribution |7| which has a long power-law 
tail reaching as high as hundreds of targets. 

Several possible regulatory network architectures ensuring nec- 
essary coordination of activity of upstream and downstream path- 
ways are shown in Figs. |2ji-D. Models shown in Fig. [2f -D solve 
the coordination problem by adding regulatory interactions among 
transcription factors. The positive regulation TF2 — > TF1 in Fig. [2f 
ensures that the nutrient processed by the red pathway would be con- 
verted to the central metabolism (dark green area) by the downstream 
part of the blue pathwa)Q. One problem with adding the TF2 — )■ TF1 
regulation is that it stimulates some unnecessary enzyme production. 
Indeed, the presence of the red nutrient triggers the production of en- 
zymes of the entire blue pathway including those located upstream 
of the merging point with the red pathway which are not required 
for red nutrient utilization. To eliminate this waste of resources we 
added negative regulations of these upstream enzymes by TF2 (see 
Fig. [2f). Other architectures shown in Fig. |2jj and|2jD instead of 
suppressing the upstream enzymes of the blue pathway exclusively 
activate its downstream enzymes. In Fig. [2j3 transcription factors 
regulate the entire length of the long path from every leaf (nutrient) 
all the way down to central metabolism. Another option illustrated 



in Fig. [2jD is to add a new transcription factor (TF3) activated by the 
TF2 to regulate only the downstream part of the blue pathway. Even 
though the number of transcription factors in Fig. |2jD is up to two 
times larger than the number of leaves in the metabolic network, we 
have verified that their quadratic scaling remains unchanged. 

Transcription regulatory networks are also characterized by a 
large number of feed-forward loops 1 18 1. It has been also conjectured 
1 1 8 1 that some of them serve as low-pass filters buffering against tran- 
sient fluctuations in nutrient availability. Such loops could be easily 
incorporated in our models. One possibility would be to add regula- 
tory interaction between TF2 and TF1 in Fig. [2j3. For the model in 
Fig. |2jD one might extend the range of TF2 to include at least part of 
the targets of TF3 and/or add a regulatory interaction between TF1 
and TF2. Our simulations of models in Fig. [2}3-D indicate that they 
all give rise to very long regulons. The distribution of regulon sizes 
of these models shown in Fig. S4 has a tail significantly broader than 
the one empirically observed in E. coli 1 15 1. A detailed study of reg- 
ulatory network architectures used by real-life prokaryotes to ensure 
coordination of activity of their metabolic pathways goes beyond the 
scope of this study and will be addressed in our future research. 

Prokaryotic genomes are shaped by horizontal gene transfer 
and prompt removal of redundant genes. The Horizontal Gene 
Transfer (HGT) of whole modules of functionally related genes from 
other organisms is the likely mechanism by which new pathways are 
added to the metabolic network in our model. Indeed, the rules of 
our model imply that an organism acquires several enzymes neces- 
sary to utilize a new nutrient not one by one but all together. Indeed, 
a pathway converting a nutrient to a downstream product that is dis- 
connected from the rest of the metabolic network does not contribute 
to biomass production and thus confers no evolutionary advantage 
to the organism. The dominant role of HGT in shaping contents 
of prokaryotic genomes in general and their metabolic networks in 
particular is well documented |21 1. For example, a recent empirical 
study 1 1 1 1 reports that horizontally transferred enzymes 

• Outnumber duplicated enzymes during the last 100 million years 
in evolution of E. coli. 

• Frequently confer condition-specific advantages, facilitating 
adaptation to new environments. As a consequence, horizontally- 
transferred pathways tend to be located at the periphery of the 
metabolic network rather than near its core. 

• tend to come in functionally-coupled groups (see also (9) for a 
genome-wide analysis of this trend). 

These empirical observations make the central assumptions of our 
model all the more plausible. Another feature of evolution of 
prokaryotic genomes used in our model is their tendency to promptly 
remove redundant genes. Indeed, in our model we implicitly assume 
that if a set of horizontally transferred genes contains some enzymes 
that are already encoded in the genome, these redundant copies are 
promptly removed. Stopping the added metabolic branch precisely 
at the intersection point with the existing metabolic network corre- 
sponds to instantaneous removal of these redundant genes. We ver- 
ified that this simplification could be relaxed without changing scal- 
ing exponents of the model . This is demonstrated in Fig. S2 in 
supplementary materials where we simulated a version of our model 
assuming more realistic finite rate of removal of redundant genes. 

Both these features (massive horizontal gene transfers and 
prompt removal of redundant genes) are not characteristic of eukary- 
otic genomes in general, and those of multicellular organisms in par- 
ticular. That is consistent with our finding of approximately linear 



'Note that in biosynthetic (anabolic) pathways the direction of metabolic flow is opposite to that 
in a nutrient-utilization (catabolic) pathways used in our illustrations (Fig. IzlX-D). As a result, the 
direction of regulatory interactions between transcription factors should be reversed as well. Thus 
in biosynthetic pathways one expects more centrally-positioned regulator with larger out-degree to 
regulate its more peripheral (and less connected) counterparts as is known to be the case e.g. in 
the leucine biosynthetic pathway (see 20 and references therein) 
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scaling of Ntf with N gen(is in genomes of animals (see Fig. S5 
where the best fit exponent 1.15 ±0.2). The best fit exponent for all 
eukaryotic genomes (1.3 ± 0.2 |4|) is marginally higher and is still 
much lower than its value in prokaryotes (2.0 ± 0.2). 

Several earlier modeling efforts (4] [22] 1231 explained the 
quadratic scaling in terms of gene duplications followed by diver- 
gence of the resulting paralogs. Models of this type assume that ad- 
ditions and deletions of individual genes are to a large degree decou- 
pled from their biological function. Conversely, our model is, to the 
best of our knowledge, the first attempt to explain this scaling rela- 
tion in purely functional terms. Instead of single genes we add and 
delete larger functional units (metabolic pathways) and assume that 
they are retained by evolution only if they positively contribute to the 
functioning of the organism, that is to say if they get connected to its 
biomass production through the existing metabolic network. Also, 
contrary to earlier explanations |4 22, 23 1, our toolbox model relies 
on a different evolutionary mechanism (horizontal gene transfer vs 
gene duplications) that is predominant in prokaryotes. 

How quickly do new pathways acquire transcriptional regu- 
lators?. In our model we assume that the regulatory network closely 
follows changes in the metabolic toolbox of the organism. For the 
sake of convenience in our simulations we choose to assign regula- 
tors de novo to each new state of the metabolic network. To verify 
that this simplification does not distort our final results we studied a 
variant of our model in which the transcriptional regulatory network 
dynamically follows changes in the metabolic network. The regulon 
size distribution in this model was essentially unchanged from the 
case where regulators were assigned de novo. Such nearly immedi- 
ate assignment of regulators to newly acquired pathways is supported 
by the empirical study of Price and collaborators 1 24 1 reporting that 
horizontally transferred peripheral metabolic pathways frequently in- 
clude their own transcriptional regulators. This should come as no 
surprise, given many well known cases where metabolic enzymes 
and their regulators either belong to the same operon or are located 
very close to each other on the chromosome (as e.g. the Lac repres- 
sor and the Lac operon). Our model is also consistent with the selfish 
operon theory 1 25 1 stating that genomic proximity of functionally re- 
lated genes is favored by evolution since it increases the likelihood of 
a successful horizontal transfer of a fully functional pathway. 

Overall, the emerging consensus 1 26] is that regulatory networks 
in prokaryotic genomes are flexible, quickly adaptable, and rapidly 
divergent even between closely related strains. Thus, even in cases 
when a horizontally transferred pathway does not include a dedicated 
transcriptional regulator it could nevertheless be quickly acquired in 
a separate HGT event or created by gene duplication of another TF 
in the genome. 

Materials and Methods 

Numerical simulations of the model. Metabolic network in our 
model is shaped by randomly repeating pathway addition and path- 
way removal steps. The boundary conditions for this stochas- 
tic process do not allow the number of metabolites to fall be- 
low 40 or exceed about 1600. Networks with different values of 
N me t are then sampled and analyzed. The universal network used 
in our study consists of the union of all reactions listed in the 
KEGG database |8|. The directionality of reactions and connected 
pairs of metabolites are inferred from the map version of the reac- 
tion formula: |f tp : / / ftp . genome . jp/pub/kegg/ligand/| 
|reaction/ reaction_mapf ormula ■ lst| Since our goal is to 
model the conversion of nutrients to organism's biomass we kept the 
metabolites located upstream of the central metabolism (reachable 
by a directed path from Pyruvate). This left us with 1813 metabo- 



lites connected by 2745 edges. The exact size and topological struc- 
ture of the universal network is not known. To test our model on 
a universal network of a different size (red squares in Fig. |4ji) we 
pruned the KEGG network down to ~ 900 metabolites. This pruning 
was achieved by randomly removing nodes along with branches that 
got disconnected from the central metabolism. In yet another ver- 
sion shown in Supplementary Fig. SI the universal network is made 
of random walks on the fully connected graph of N un iv = 1800 
metabolites. From this figure it follows that properties of our model 
are mainly determined by the number of nodes in the universal net- 
work and not by details of its topology. 

1) Pathway addition. One randomly chooses a new leaf (nutri- 
ent) and a self-avoiding random walk on the universal network. This 
directed walk is started at the leaf and extended until it first intersects 
the subset of N met presently metabolizable molecules. The leaf plus 
all the intermediate metabolites of this new branch thereby become 
metabolizable. 

2) Pathway deletion. One of the Ntf network leaves (nutrients) 
is chosen randomly. The links downstream from this leaf are fol- 
lowed until the first merging point of two metabolic branches. All 
the metabolites down to this merging point are removed from the 
network, thereby becoming non-metabolizable. 

We typically choose to begin all simulations with 20 nodes in the 
"metabolic core" (the dark green central circle in Figs. 1 1 12b that are 
already metabolizable. This core could be thought of as the "univer- 
sal central metabolism" present in most organisms. The number of 
these core metabolites, N cors , is the second parameter of our model. 
However, in practice, as long as iV core <C N un i V , the network topo- 
logical structure in the steady state does not depend on the value of 
iV core . In our simulations we also tried different starting sets of me- 
tabolizable molecules connected by linear branches to the core but 
inevitably arrived to the statistically identical steady-state networks. 

Sources of empirical datasets. The distribution of branch lengths 
in Fig. |2]\ was calculated as follows: first a leaf was randomly cho- 
sen and followed to the metabolic core. Subsequent branches were 
followed until the merging point with another branch that was previ- 
ously selected. In the metabolic network of the K-12 strain of E. coli 
leaves were defined as either 1) having zero in-degree (no produc- 
tion within the organism) or 2) having an undirected degree of one 
(endpoints of linear branches formed by reversible reactions). The 
backbone of the E. coli network was defined by following random 
linear paths starting at these leaves and ending at the intersection with 
each other or at the metabolic core. This left us with the network in 
Fig. [3j\ of ~ 420 metabolite nodes (including 112 leaves) located 
upstream of the central metabolism 1 8 1 . 

To estimate the number of transcription factors in differ- 
ent genomes shown in Fig. |4]\ (green symbols) we used the 
DBD database 1271 ( [www . transcriptionf actor . org) with 
its manually curated list of 147 Pfam families of transcription fac- 
tors. The resulting values of Ntf are in good agreement with those 
obtained in earlier studies [3 4 5] ED- 
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Addition of new pathway: 




Removal of pathway: 




Fig. 1. "Toolbox" rules for evolving metabolic networks in our model. A) addition of a 
new metabolic pathway (red) that is long enough to connect the red nutrient to a previously 
existing pathway (blue) which further converts it to the central metabolic core (dark green). 
B) removal of a part of the blue pathway following loss of the blue nutrient. The upstream 
portion of the blue pathway that is no longer required is removed down to the point where 
it merges with another pathway (red). The light green circle denotes all metabolites in the 
universal biochemistry network from which new pathways are drawn (see text for details). 




C) „ ? • TF2 D) • TF2 




Fig. 2. Schematic diagrams illustrating several possible regulatory network architectures for 
control of metabolic enzymes/pathways. Four panels correspond to different versions of our 
model discussed in the text. In the basic model (panel A) there is no coordination of activity 
between red and blue metabolic pathways. More realistic models (panels B-D) include extra 
regulatory interactions (purple dashed lines) and transcription factors (purple TF3 in panel D) 
ensuring that only the part of the blue pathway necessary for utilization of the red nutrient is 
turned on by the corresponding transcription factor (red TF2). 
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Fig. 3. A. The backbone of the metabolic network in E. coli 00 located upstream of central 
metabolism (green). B. A similarly-sized network generated by our model (red). Note hierarchy 
of branch lengths in both panels in which shorter pathways tend to be attached to progressively 
longer pathways. The branch length distributions in real and model networks are shown as 
green circles and red squares in Fig. [4j3. 
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branch/regulon size 

Fig. 4. A. The number of transcription factors scales approximately quadratically with the 
total number of genes in real prokaryotic genomes 8 27 (green) and our model (red) simulated 
on the KEGG universal network with N un i v = 1800. The number of metabolic reactions 
in the model was rescaled to approximate the total number of genes in a genome (see text for 
more details). Error bars correspond to data scatter in multiple simulations of the model. The 
solid line with slope 2 is the best powerlaw fit to the scaling in real prokaryotic genomes (the 
best fit to our model is 1.8 i 0.2), while the dashed line with slope 1 is shown for comparison 
to emphasize deviations from linearity. B. Cumulative distributions of pathway/branch lengths 
in the E. coli metabolic network (green circles) and our model of comparable size (red symbols) 
have similar tail exponents. The negative slope of the best powerlaw fit *y — 1 = 1.9 i 0.2 
(solid line) is consistent with our analytical result 7 = 3 (see text for details). The toolbox 
model with N me t = 400 was simulated on universal networks of KEGG reactions with 
N un i v — 1800 (red diamonds) and N un i v = 900 (red squares) nodes. 
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